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Authors' Abstract 

A frequent dilemma in programming language design is the choice between a 
language with a rich set of notations and a small, simple core language. We 
address this dilemma by proposing extensible grammars, a syntax-definition 
formalism for incremental language extensions and restrictions. 

The translation of programs written in rich object languages into a small 
core language is defined via syntax-directed patterns. In contrast to macro- 
expansion and program-rewriting tools, our extensible grammars respect scoping 
rules. Therefore, we can introduce binding constructs while avoiding problems 
with unwanted name clashes. 

We develop extensible grammars and illustrate their use by extending the 
lambda calculus with let-bindings, conditionals, and constructs from database 
programming languages, such as SQL query expressions. We then give a formal 
description of the underlying rules for parsing, transformation, and substitution. 
Finally, we sketch how these rules are exploited in an implementation of a 
generic, extensible parser package. 
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1 Introduction 



A frequent dilemma in programming language design is the choice between a 
user-friendly language with a rich set of notations and a small, conceptually 
simple core language. We address this dilemma by introducing extensible gram- 
mars, a syntax-definition formalism for incremental, problem-specific language 
extensions and restrictions. 

The translation of programs written in rich object languages into a small 
core language is defined via syntax-directed patterns. The translation resembles 
macro expansion, with some essential differences. Traditional macro-expansion 
and program-rewriting tools attempt to manipulate programs as mere strings 
or trees. This is the source of many of their well-known defects. In contrast, our 
extensible grammars recognize and respect the scoping structure of programs. 

The features of our approach are as follows: 

• Lexical scoping is strictly preserved. Therefore, we can introduce new 
binding constructs like quantifiers, iterators, and type declarations, while 
avoiding problems with unwanted name clashes ( "variable captures" ) . 

• Parsing remains independent of type checking and evaluation. It always 
terminates. 

• We can determine, statically, what is the legal syntax in any region of the 
text of a program. 

• We can freely introduce new notation and mix it with existing notation 
without special quotations, antiquotations, or explicit macro calls. 

• New notation can be defined in terms of old notation, incrementally. 

• Our syntax-definition package is language-independent. 

The form of extensible grammars discussed in this paper was invented during 
the implementation of a polymorphically typed lambda calculus [Car93]. Here, 
we develop extensible grammars in a more general context and describe them 
in more detail. 

We motivate and illustrate the use of extensible grammars with examples 
from various domains, but we emphasize the application of extensible grammars 
for database programming. Current database systems typically rely on macro 
preprocessors in order to embed query notations in host languages like C or 
Cobol. Our extensible grammars may serve as a safe alternative to macros in 
this context. 

Both syntax extensions and syntax restrictions occur commonly in practice, 
and extensible grammars are designed to support them both. 

Syntax extensions provide syntactic sugar for problem-specific abstrac- 
tion. Syntax extensions have long been used in Lisp systems; recent work has 
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focused on avoiding variable captures (see section 6). Notational definitions 
make sense not only in programming but also in mathematics, in particular in 
logical frameworks [Gri88]. 

Syntax extensions have a variety of applications in database programming. 
For example, embedded query notations like the relational calculus, the rela- 
tional algebra, iteration statements, or set comprehensions can be introduced 
as abstractions defined from primitive iteration constructs [OBBT89, BTBN91, 
Tri91, MS91]. Transactions can be introduced as stylized patterns for side-effect 
control and exception handling. Similarly, structured form definitions in user 
interface code can be represented as abstractions over low-level routines for data 
formatting, input, and validation. At the type level, data modeling constructs 
like classes, objects, and binary relationships can be viewed as syntactic sugar 
for more complex type expressions involving recursive types, record types, func- 
tion types, or abstract data types [SSS + 92, SSS88, PT93]. 

Syntax restrictions introduce intentional limitations on the expressive- 
ness or orthogonality of a core language. One rationale behind restrictions is to 
facilitate meta-level reasoning and optimizations tailored to a particular appli- 
cation domain. In addition, syntax restrictions can serve to enforce the use of 
subsets of languages. For instance, a syntax restriction may forbid imperative 
programming in student projects. 

While ad-hoc syntax restrictions are generally considered harmful in pro- 
gramming language design (from a pragmatic and a semantic perspective), 
they are common practice in database models and languages. For example, 
many schema definition languages disallow nested declarations (nested sets, 
nested classes) or limit recursive declarations to top-level class or type def- 
initions. Furthermore, user-defined types frequently do not have first-class 
status, and in particular they may not appear as arguments to collection- 
type constructors. Similarly, query languages typically impose restrictions to 
rule out side-effecting operations or calls to user-defined functions in selection 
and join predicates [SQL87]. Some query languages require static bindings to 
function identifiers (disallowing higher-order functions or dynamic method dis- 
patch) [SFL83], and some disallow lambda abstractions within quantified ex- 
pressions [BTBN91]. Finally, recursive queries or views are often subject to 
stratification constraints [Naq89]. 

The next section gives an overview of the issues that must be addressed by 
a formalism for language extensions and restrictions. In section 3 we introduce 
extensible grammars by examples. An initial grammar for the lambda calcu- 
lus is extended incrementally with new syntactic forms such as let-bindings, 
conditionals, and query notations. In section 4 we define the static type rules 
for grammar definitions and the semantics of parsers generated from extensible 
grammars. We also present a soundness result for the type system with re- 
spect to the evaluation semantics. In section 5 we describe the implementation 
of an extensible parser module for the Tycoon database environment [Mat93]. 
Finally, section 6 is a comparison with other approaches to syntax extension. 
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TL Type Checker 
& Code Generator 



Figure 1: The syntax-definition scenario 

2 Overview 

The syntax extension formalism described in this paper assumes the scenario 
depicted in figure 1. Given the abstract syntax and the scoping structure of 
a target language TL, a new object language OLo can be defined by giving 
its context-free grammar and the rewrite rules that map OLo terms into TL 
terms. The mapping also defines the scoping structure of OLo- Our formalism 
is incremental since it also allows the definition of an object language 0L n by 
a translation (rewriting) into another object language 0L n _\. 

For example, assuming TL to be a functional language, the object language 
OLo could have either a Lisp-like list notation or an Algol-like keyword-based 
notation: 

(defn succ(x) (plus x 1)) 

function succ (x) ; begin return plus(x, 1) end 

Both syntactic forms translate into the same abstract syntax tree in the target 
language TL that is passed to the TL type checker and code generator: 

Bind(succ Abs(x App(App(plus x) 1))) 

Subsection 3.1 gives a complete example of the target-language and the object- 
language definition for an untyped lambda calculus. 

A simple example of an incremental syntax definition is the definition of a 
language with infix function application (OLi) as an extension of a language 
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with only prefix application (OLo). The notation A =>• B is used to indicate 
that the input A in an extended language is equivalent to the input B in a 
non-extended language: 



begin return x + 1 end =>• begin return plus(x,l) end 

In a database programming setting, 0L n could be a language with SQL-like 
query notations that is translated into a lambda calculus, OL n _i, with primitive 
operations on a collection type (nil, cons, iter) [Tri91]: 



Incremental grammar definitions are discussed in more detail in subsections 3.2 
and 3.3. The definition of an SQL-like grammar in our formalism is given in 
subsection 3.4. 

Extensible grammars require extensible parsers. That is, a parser has to be 
dynamically extensible to handle programmer-defined object languages. New 
grammar definitions should be checked to avoid problems typical of macro def- 
initions [KR77], such as grammar ambiguity, non-termination of macro expan- 
sion, and generation of illegal syntax trees. Our checking is done at grammar- 
definition time and includes standard grammar analysis [ASU87] to avoid the 
first two problems. To address the third problem, we develop a typing discipline 
on productions (see subsection 4.1). 

A more subtle source of difficulties associated with incremental grammar 
definition is the binding structure of the target language. The rewriting of 
object-language expressions into target-language expressions must be sensitive 
to the scoping rules of the target language and may require renaming operations 
to avoid name clashes ( "variable captures" ) . A small example using C and the 
C preprocessor illustrates the issue in a familiar setting: 

#define swap(x,y) {int z; z = x; y = x; x = z;} 

{int a, b; swap(a,b);} /* ok */ 

{int z, y; swap(z,y);} /* name clash */ 

The expansion of swap(z, y) leads to the program fragment {int z; z = z; 
y = z; z = z}, where the local declaration of z hides the variable z that is 
passed as an argument to the macro. Removing the curly brackets in the macro 
definition does not solve the problem, but causes a name clash between two 
declarations of the variable z in the same scope. 

In order to solve the scoping problems caused by rewriting inside binding 
structures, a formalization of the scoping rules of the target language is re- 
quired. To adapt our grammar formalism easily to several target languages, 
we divide the scoping problem into a generic bookkeeping task for the exten- 
sible parser and a parameterized language-specific renaming operation. This 



function succ (x) ; 



function succ(x); 



select x.a 
from x in X =>• 
where p(x) 



iter(X) (nil) (fun(x)fun(z) 
if p(x) then cons(x.a)(z) else z) 
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conceptual division of labor is exploited in the implementation of the exten- 
sible grammar package to factor out target-language dependencies. Scoping 
problems are avoided by distinguishing between binding and applied identifier 
occurrences, and by renaming when name clashes between identifiers in input 
programs and identifiers in rewrite rules could occur. Note that this solution is 
not an option for a simple token-based preprocessor. Subsection 4.2 describes 
the parsing and renaming rules of our formalism (for initial as well as incremen- 
tal grammar definitions). We are also able to prove that these dynamic parse 
rules are consistent with the static type rules given in subsection 4.1. 

3 Grammar Definitions 

In this section we introduce our extensible grammar formalism by examples. 
We start with a small initial grammar for an untyped lambda calculus that is 
extended incrementally to support database programming language constructs. 

3.1 Initial Grammar Definitions 

This subsection explains how to define the abstract syntax and the scoping rules 
of a particular target language TL as well as the syntax for an initial object 
language OLo (see the oval boxes in figure 1). This information is validated 
by the grammar checker and then used to generate an initial parser for OLo 
programs. 

We use an untyped lambda calculus with records as the target language for 
our examples. Given a set of identifiers x, the sets of terms (a, 6) and fields (/) 
are recursively defined as follows: 

a, b ::= x \ Xx.a \ a(b) | {/} | a.x 
f ::= 0 | x=a f 

The first step in the definition of an extensible grammar is to define the 
names of the sorts and the signatures of the constructors available for the 
construction of target-language terms. Our example uses the following sorts, 
specific to the target language: 

Term terms of the lambda calculus 

Fields ordered associations between field names and terms 

Since identifiers require particular attention during expression rewriting, three 
predefined sorts exist to distinguish the binding properties of identifiers: 

Binder identifiers appearing in binding positions 
Var identifiers appearing in the scope of a binder 

Label identifiers that are not subject to scoping 
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These sort names appear in the signatures of the term constructors for the 
lambda calculus: 

mkTermVar (x : Var) :Term 
mkTermFun(x : Binder a:Term):Term 
mkTermApp(a:Term b: Term): Term 
mkTermRcd(f : Fields) :Term 
mkTermDot(a:Term x: Label) 
mkFieldNilO : Fields 

mkFieldCons (x : Label a: Term f : Fields ): Fields 

Lambda abstractions (mkTermFun) introduce identifiers in binding positions, 
while other identifiers inside terms (mkTermVar) appear in non-binding posi- 
tions. In our example, field labels (mkTermDot, mkFieldCons) are not subject 
to block-structured scoping rules and are therefore defined to be of sort Label. 
For the purpose of grammar definitions it is not necessary to present the binding 
rules of the target language in more detail. 

Given a target-language description in terms of constructors and sorts, a 
context-free grammar is defined as a collection of productions that translate 
phrases in an input stream into terms of the target language. A concrete syntax 
for the lambda calculus with records is defined in figure 2. The notation used 
is explained in the rest of this subsection. 

This grammar consists of four mutually recursive productions that define 
left-associativity of applications and precedence of applications over abstrac- 
tions. Here are examples of input phrases parsed according to the root produc- 
tion term: 

peter mkTermVar (peter) 

peter. age mkTermDot (mkTermVar (peter) age) 

fun(p)p(b) mkTermFun(p mkTermApp (mkTermVar (p) mkTermVar (b) ) ) 

The result of parsing is a structured term of the target language. This term 
can be viewed as a tree in which the inner nodes correspond to term constructor 
applications and the leaves correspond to identifiers (or literals) extracted from 
the source text. A token sequence to which no production applies is rejected by 
the parser with an error message. 

A grammar introduces a set of non-terminals (simpleTerm, term, . . .) as 
identifiers for productions. Productions can be parameterized by terms of the 
target language (see, e.g., termlter). The signature of a non-terminal defines 
its parameter names and sorts as well as the sort of terms returned by the 
production. For example, the production termlter takes a parameter a of sort 
Term and returns a term of sort Term. 

The body of each production consists of n > 1 expression sequences sepa- 
rated from each other by a vertical bar ( I ) . Each expression specifies an input 
syntax and a result expression (following the => symbol) to construct a term of 
the target language. Based on the token sequence encountered during parsing, 
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grammar 

simpleTerm:Term == 
x=ide 

|"(" a=term ")" 

I "fun" "(" x=ide ")" a=term 

|"{" f=fields "}" 

I a=plde : Term 

fields :Fields == 
x=ide "=" a=term f=fields 

I 

I f=plde: Fields 

term: Term == 
a=simpleTerm b=termlter (a) 

termIter(a:Term) :Term == 
"(" b=term ")" 
I"." x=ide 
I 

end 



=> mkTermVar(x) 
=> a 

=> mkTermFun(x a) 
=> mkTermRcd(f) 
=> a 



=> mkFieldCons (x a f) 
=> mkFieldNilO 
=> f 



=> b 



=> termlter (mkTermApp(a b)) 
=> termlter (mkTermDot (a x)) 
=> a 



Figure 2: Definition of a concrete syntax for the lambda calculus 



one of the alternative expression sequences is selected and its corresponding 
result expression is evaluated in an environment that contains the actual pa- 
rameter bindings and local bindings introduced on the left of the => symbol. 

The input syntax accepted by an alternative is defined using the following 
notation: 

"x" accept the keyword x 

ide accept any non-keyword identifier 

x accept the input specified by the production identified by the non- 

terminal x 

x(y) accept the input specified by the parameterized production iden- 
tified by the non-terminal x with the argument y 
x=y bind the term defined by y to a local variable x 

pIde:S accept a pattern variable of sort S (see subsection 3.3) 

Each grammar determines a set of keywords reachable from productions of 
the grammar. The set of identifiers accepted by ide in a given grammar g 
excludes the keywords of g. Therefore, syntax extensions may introduce new 
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keywords while syntax restrictions may change existing keywords into identifiers. 

The binding structure of the concrete syntax is defined implicitly by passing 
identifier tokens from the input as arguments to term constructors. For example, 
the variable x in the grammar definition 

"fun" "(" x=ide ")" a=term => mkTermFun(x a) 

appears in a Binder position of the term constructor mkTermFun. Therefore, it 
can be deduced that the variable person in the source text fun(person) . . . 
appears in a binding position. 

The recursive production fields in figure 2 generates right-associative syn- 
tax trees for field lists while the production termlter generates left-associative 
syntax trees for function applications. Because we use an LL(1) parser, left- 
associative grammars are handled in our grammar formalism by passing the 
syntax tree for the left context of a phrase as a production argument for the 
recursive invocation of a production (e.g., a: Term in production termlter in 
figure 2). 



3.2 Incremental Grammar Definitions 

This subsection explains how to define the syntax of a new object language 0L n 
as an extension or a restriction of an existing object language 0L n -\. Such a 
syntax redefinition is validated by the grammar checker and used to derive a 
parser for 0L n from an existing parser for 0L n -\. 

A grammar defines a mapping from non-terminals (e.g., simpleTerm, term) 
to variables that are initialized with productions. Inside a production, each 
non-terminal denotes the production identified by its variable. Three incremen- 
tal grammar operations are available: addition, extension, and update. The 
rationale behind these operations is to allow the update and re-use of existing 
non-terminal definitions, preserving the recursive structure of the grammar. 

A grammar addition (==) defines a mapping from a non-terminal to a newly 
created variable initialized with a production. For example, we could use the 
standard encoding of let bindings: 

let x=a in b =>• (fun(x) b)(a) 

to add the new non-terminal topLevel: 

grammar 

topLevel : Term == 
a=term => a 

I "let" x=ide "=" a=term 

"in" b=topLevel => mkTermApp(mkTermFun(x b) a) 

end 
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The non-terminal topLevel is mapped to a newly created variable initialized 
with a production that accepts terms of the base language and (nested) let 
bindings at the top level, but not inside terms. 

A grammar extension (|==) destructively updates the variable identified by 
a non-terminal with a new production. The new production extends the old 
production with additional alternatives. For example, to extend simpleTerm, 
we could write: 



grammar 

simpleTerm: Term |== 
"unit" 

I "let" x=ide "=" a=term 
"in" b=term 
end 



=> mkTermRcd(mkFieldNil() ) 

=> mkTermApp(mkTermFun(x b) a) 



This grammar extension affects all productions referring to term, allowing unit 
and nested let bindings within terms. 

A grammar update ( : ==) destructively updates the contents of a variable 
identified by a non-terminal with a new production that has the same signature, 
thereby affecting all productions referring to that non-terminal. For example, 
the definition of term could be updated as follows: 



grammar 

term: Term :== 
x=ide 

|"(" a=term b=term ")" 
|"{" f=fields "}" 
end 



=> mkTermVar(x) 
=> mkTermApp(a b) 
=> mkTermRcd(f) 



This redefinition affects all productions referring to term (simpleTerm, fields, 
termlter), thereby restricting the expressiveness of the original language by 
disallowing abstractions. 



3.3 Pattern-based Action Definitions 

In subsection 3.2, abstract syntax trees produced by actions are specified with 
explicit constructor applications. In this subsection we introduce patterns which 
allow us to write grammars more conveniently by using the existing target lan- 
guage. For example, the syntax for let and where bindings could be written 
more clearly using a pattern: 

grammar 

simpleTerm: Term |== 
"let" x=ide "=" a=term 

"in" b=term => term«(fun(x) b)(a)» 

end 
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Inside the pattern term<<(fun(x) b)(a)>>, the variables x, a, and b, intro- 
duced on the left-hand side of the production, act as placeholders (pattern 
variables) of sort Binder, Term, and Term, respectively. A pattern p<<s>> in a 
grammar g is translated into constructor applications by parsing the input token 
stream s starting with the production p. For example, when the token stream 
(fun(y) b) (a) is parsed as a term, the pattern term<<(fun(y) b) (a)>> yields 
the nested constructor application mkTermApp(mkTermFun(y b) a). Note that 
the concrete syntax and the binding structure inside the <<>> brackets is defined 
by the grammar that is valid before the enclosing grammar end block. 

The keyword plde followed by a sort identifier is used in the initial grammar 
definition (see subsection 3.1) to indicate those positions in the input syntax 
where pattern variables may appear. For example, f is a pattern variable of 
sort Fields in the pattern <<{f }>>. Pattern variables of the sorts Binder, 
Var, and Label may appear also at those places in the input syntax where the 
keyword ide is used to accept identifier tokens of the appropriate sort. 

To avoid variable captures and name clashes, many pattern-based syntax 
extensions require the introduction of fresh identifiers, that is, identifiers distinct 
from other identifiers appearing in Binding and Var positions. For example, the 
syntax for functional composition (f *g) could be defined as: 

grammar 

termIter(a:Term) :Term |== 
"*" b=term x=local => termlter(term«fun(x)a(b(x) )») 
end 

The notation x=local guarantees that a fresh identifier is bound to x for ev- 
ery instantiation of this production during parsing. For example, f*g*h is 
expanded to fun(x2) (f (fun(xl)g(h(xl) ) ) (x2)), and x*y is expanded to 
fun(xl) (x(y(xl) ) ) , avoiding a variable capture of the input variable x by a 
binder introduced in the pattern. 

Since grammar definitions can be interspersed with object-language expres- 
sions, it is desirable to allow patterns to contain variables that refer to global 
bindings. For example, the boolean constants true and false are sometimes 
represented by the following functions which, when applied to two arguments, 
return one of them: 

let T = fun(x)fun(y)x 
let F = fun(x)fun(y)y 

In the scope of these definitions, the following grammar could be defined to 
replace the keywords true and false by the variables T and F, respectively. 

grammar 

simpleTerm:Term |== 
"true" => term«T» 

I "false" => term«F» 
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I "if" a=term "then" b=term 
"else" c=term 
end 



=> term«a(b) (c)» 



During expansion of a pattern with free variables (T and F in the example 
above), unwanted variable captures must be avoided. For example, a naive 
macro expansion of the term fun(T) T(true) would yield the term fun(T) 
T(T) where the expansion of the keyword true is bound incorrectly. Therefore, 
free variables in extensible grammars are handled as follows: Each occurrence 
of a free variable x in a grammar definition is replaced by a fresh variable 
x'. During parsing, these modified patterns generate expansions that contain 
unbound variables (T' and F'). For example, T(fun(T) T(true)) is expanded 
to T(fun(T) T(T ' ) ) • After the full input has been parsed, a renaming function 
is applied to the parsed term. The renaming function depends on the target 
language. In this case, it replaces the binder T and its bound variables by T' ', 
and T' by T. The resulting term T(fun(T' ' ) T' ' (T) ) is then submitted to the 
type checker and code generator. 

3.4 Further Examples: Query Notations 

In this subsection we show how some typical database query notations can be 
viewed as mere "syntactic sugar" for the application of a single higher-order 
iterator function. The reduction of query notations into a single canonical 
iteration construct has been exploited in the literature to simplify the type 
checking of database programming languages [OBBT89], the code generation 
for query expressions [Tri91], and the verification of functional database pro- 
grams [SS91, SSS88]. The following examples demonstrate that extensible gram- 
mars provide sufficient expressive power to define the syntax of typical database 
query languages as well as their translation into lambda calculus. This transla- 
tion preserves the usual scoping rules defined for these query languages. 

We assume the grammar extension for booleans defined above, and the fol- 
lowing global definitions that provide a standard encoding of the list construc- 
tors nil and cons and of the list iterator iter: 

let nil = fun(x)fun(n)fun(c) n 

let cons = fun(hd)fun(tl)fun(n)fun(c) c(hd) (tl(n) (c) ) 
let iter = fun(l)fun(n)fun(c) l(n)(c) 

The syntax of a "list algebra" with selection, projection, and binary join can 
then be defined as follows: 

grammar 

simpleTerm:Term |== 

"select" x=ide "in" a=term "where" b=term y=local 
=> term<<iter (a) (nil) (fun(x)fun(y)if b then cons(x)(y) else y)>> 
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I "project" x=ide "in" a=term "onto" f=f ieldList(x) y=local 
=> term«iter(a) (nil) (fun(x)fun(y)cons({f }) (y))» 
I "join" x=ide "in" a=term "," y=ide "in" b=term 

"where" c=term x2=local y2=local 
=> term«iter(a) (nil) (fun(x)fun(x2) iter(b) (x2) (fun(y)fun(y2) 

if c then cons({fst=x snd=y}) (y2)else y2))>> 
f ieldList(x: Var) : Fields == 

y=ide "," f =f ieldList (x) => fields«y=x.y f» 

I => fields«» 

end 

For example, a selection expression with a variable identifier x, a range expres- 
sion a, and a selection predicate b is translated into an iterative loop. This loop 
over a has x as its loop variable. Starting with the empty list nil, the loop adds 
those elements that satisfy the selection predicate b: 

iter (a) (nil) (fun(x)fun(y)if b then cons(x)(y) else y) 

In this expression, y is a fresh local variable which is bound during iteration to 
the result of the previous iteration step. This translation correctly captures the 
scoping rules for the list algebra, since the variable x is visible only in b and not 
in a. Furthermore, global identifiers are visible in a and b. 

The parameterized production f ieldList demonstrates how parameters 
may be used to distribute terms (in this case a variable identifier x) into mul- 
tiple subterms. Using the extended grammar one can write, for example, the 
following queries that use global identifiers Persons, thirty, and equal: 

select p in Persons where greater (p . age) (thirty) 
project p in Persons onto name, age 

join p in Persons, s in Students where equal (p. name) (s .name) 

Furthermore, it is possible to nest queries and to parameterize queries: 

fun(limit) select p in 

select p in Persons where greater(p. salary) (limit) 
where greater (p . age) (thirty) 

Note that the identifier p in the subquery will be correctly bound to the inner 
p in the generated lambda term. 

Simulating SQL expressions is slightly more complicated, since SQL allows 
the repetition of range expressions to express selections, projections, and n-way 
joins using a uniform notation: 

select target (x) from x in a where predicate (x) 

select target(x)(y) from x in a, y in b where predicate(x) (y) 

select target (x) (y) (z) from x in a, y in b, z in c 

where predicate(x) (y) (z) 
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Therefore, the rewrite rules have to ensure that the target and the selection 
expressions appear in the scope of n (n > 1) fun binders in the generated 
lambda term. The following grammar uses a recursive, parameterized produc- 
tion rangelter to achieve the desired rewriting: 

grammar 

simpleTerm:Term |== 

"select" a=term "from" x=ide "in" b=term c=rangelter (a) 
=> term«iter(b) (nil) (fun(x)c)» 
rangeIter(a:Term) :Term == 

"," x=ide "in" b=term c=rangelter (a) y=local 
=> term«fun(y)iter(b) (y) (fun(x)c)» 

I "where" b=term y=local 
=> term<<fun(y)if b then cons(a)(y) else y>> 
end 

For example, a two-way join would be expanded as follows: 

select {x.a y.b} iter(X)(nil)(fun(x) 

from x in X, y in Y fun(zl) iter(Y)(zl)(fun(y) 

where p(x.c)(y.c) fun(z2) if p(x.c)(y.c) then 

cons({x.a y.b})(z2) else z2)) 

4 Formalizing Grammars and Parsers 

In subsection 4.1 we describe the rules that are used in the grammar checker 
(see figure 1) to statically decide whether a sequence of grammar definitions 
and grammar extensions is well-formed. In subsection 4.2 we formalize the 
parse rules that define the mapping from an input stream into a constructed 
term of the target language. We also present a soundness result of the dynamic 
parse rules with respect to the static type rules of subsection 4.1. This result 
guarantees that parsers derived from well-typed grammars return well-formed 
parse trees. In subsection 4.3 we generalize the result to parsers derived from 
incremental pattern-based grammar definitions. 

4.1 Static Typing of Grammar Definitions 

To describe the type rules for grammar definitions and extensions, we first de- 
fine the relevant syntactic objects (sorts, signatures, productions, grammars, 
grammar sequences) . 

The syntax for term sorts B and signatures S is defined as follows: 

B ::= Unit | Var | Binder | Label predefined term sorts 

| B | . . . | B n sorts specific to the target language 

S ::= (Bi, . . . , Bk)B production signatures (k > 0) 
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The abstract syntax of productions is slightly more orthogonal than the con- 
crete syntax we have used in the examples. In particular, terminal produc- 
tions like ide(B) or "x" may appear nested within constructor and production 
argument lists. Furthermore, the syntactic separation of productions into a 
binding sequence and a constructor application (to the right and left of the 
=>, respectively) is no longer enforced. For example, the production x=ide 
=> mkTermVar(x) in the concrete syntax is translated into a simple sequential 
composition x = ide(Var) mkTermVar(a;). 



p ::= unit unit production 

| "a;" keyword token production 

| ide(B) variable token production (of sort B) 

| local fresh object-language variable 

| global(;c) global object-language variable 

| x term variable 

| pi P2 sequential composition 

| x = pi p'i pattern variable binding 

| P\ \ P'i choice 

| x(pi, . . . ,pk) non-terminal application (k > 0) 

I c (B lt ...,B k )B(pi, ■■■jPk) sorted constructor application (k > 0) 



The set of constructors C(Si,...,s fc )s with argument sorts Bi and result sort B 
contains the constructors specific to the target language (e.g., mkTermVar, mk- 
TermFun). 

A grammar consists of a list of non-terminal definitions that define a signa- 
ture, a modification operator, and a production. 

g ::= [] empty grammar 

| g x : (x-\_:Bi, . . . ,Xk'.Bk)B a p non-terminal definition 

a ::= == grammar addition 

| :== grammar update 

| |== grammar extension 

Each grammar (a block of possibly recursive definitions) is defined in the scope 
of its preceding grammar definitions: 

gseq ::= empty grammar sequence 

| gseq g grammar composition 

A global environment E assigns signatures to non-terminals: 

E ::= 0 empty environment 

| E, x : S non-terminal x has signature S 

A local environment L assigns signatures to term variables: 

L ::= 0 empty environment 

| L, x : B variable x has sort B 
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Environments are ordered so that they model block-structured scoping. En- 
vironment concatenation is written as E , E' . The domain of an environment, 
denoted by Dom(E), is the set of variables x defined in E. A variable name x 
may occur more than once in an environment. In this case, the type rules for 
variables retrieve the rightmost sort or signature assigned to x. 

The static semantics of grammars involves the following four kinds of judg- 
ments, defined in the remainder of this subsection: 

E; L h p : B production p has sort B assuming E and L 

E h g :: E' grammar g defines signatures E' consistent with E 

E h g ok grammar g defines productions consistent with E 

h gseq =>• E grammar sequence gseq defines a final environment E 

The structure of the sort rules for productions resembles the structure of 
typing rules for terms in a simply typed lambda calculus: 

E;L\- unit : Unit 
E;L\~ "x" : Unit 
E;L\~ ide(B) : B 
E;L\- local : Binder 
E;L\~ global(a;) : Var 
x $ Dom(L') 



E;L,x:B,L'\-x:B 

E;L\- Pl :B E;L\-p 2 :B' 
E;L\-p lP2 : B' 

E;L\~pi : B E; L, x : B h p 2 : B' 
E; L h x = p\ p 2 : B' 

E;L\- p x : B E;L\- p 2 : B 
E;Lh Pl \p 2 :B 

E; L h pi : Bi l<i<k 
E;L\- C( Bli ... iBfc)B (pi, . ..,p k ) : B 

E;L\~ Pi : B{ \<i<k x Dom(E') 
E,x: (B 1 ,...,B k )B,E';L\- x{ Pl ,...,p k ) : B 



The type checking of a grammar g is performed in two passes in order to 
handle recursive non-terminal definitions correctly. A first pass (E h g :: E') col- 
lects the signatures E' of all non-terminals in g, verifies that each non-terminal 
is defined at most once in g, and asserts that all grammar updates (x : S:==p) 
and grammar extensions (x : S\==p) refer to non-terminals with matching sig- 
natures in the scope E of g: 
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Eh [] ::0 
E\- g ::E' x Dom(E') 



Ehg x : (x 1 :B 1 ,...,x k :B k )B == p :: E' , x : (B u . . . , B k )B 

EVgv.E' x^Dom(E') a £{:==, |==} 

E\-x:(B 1 ,...,B k )B 

E \~ g x : (xi:Bi, . . . ,x k :B k )B a p :: E' , x : (Bi, . . . , B k )B 

In a second pass (E 1 h g ok), the bodies p of all non-terminal definitions in g 
are checked to match their signatures in E. The rules for parameterized non- 
terminal definitions resemble the type rules for lambda abstractions: 

E h [] ok 

E h g ok E 1 ; Q, £1 : Bi, . . . ,x k : B k \- p : B «£{==,:==, |==} 
E \~ g x : (xi:Bi, . . . , x k :B k )B a p ok 

A sequence of grammars is verified by performing the two passes above on each 
grammar in the sequence using the environment established by its preceding 
grammars: 

hgseq^E EVgv.E' E, E' h g ok 
^gseqg E, E' 



It is possible to derive a simple consistency-checking algorithm from these 
inference rules as follows: Starting with the proof goal h gseq =>• E, the inference 
rules have to be applied "backwards" (from the conclusions to the assumptions) . 
Since for each syntactic construct there is exactly one applicable inference rule, 
the derivation either reaches the axioms (in time proportional to the size of 
the grammar) or gets stuck in a configuration where no inference rule can be 
applied. In the latter case the grammar sequence is rejected as ill-typed. In the 
next subsection we prove that parsers derived from well-typed grammars never 
generate ill-formed syntax trees. 



4.2 Parsing and Term Construction 

Each non-terminal i in a grammar serves a dual purpose. On the one hand, 
it determines how to parse an input token stream and how to construct a cor- 
responding term of the target language. On the other hand, it defines how to 
transform a pattern (a token stream inside <<>> brackets) occurring in an incre- 
mental grammar definition into an equivalent production. In this subsection we 
describe the parsing of input token streams, while pattern parsing is described 
in the subsection 4.3. 

For the purpose of parsing it is convenient to rewrite a grammar sequence 
gseq into a single grammar g of the form \\,x\ : S\==p\, . . . ,x k : S k ==p k 
(k > 0) such that ^ Xj for i ^ j. We use the notation: 
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gseq ~» g grammar sequence gseq normalizes to g 



In this rewrite process, grammar updates (x : S:==p) and grammar ex- 
tensions (x : S\==p) are eliminated by changing their corresponding original 
definitions (x : S==p') into x : S==p and x : S==p \ p' , respectively. Name 
conflicts between grammar additions x : S== p and x : S'==p' (p ^ p') in two 
grammars of gseq are resolved by consistently renaming one of the non-terminals 
to a fresh non-terminal x' within its local scope. It is easy to see that normal- 
ization preserves typing, that is, if gseq ~» g and h gseq =>• E, then h g =>• £", 
where £" is equal to £" up to duplicate elimination. 

We use the following notation to describe how a production of a grammar g 
applied to an input stream constructs a term t of the target language: 



This formula states that production p executed in environment g; M starting in 
the initial configuration (s, i) returns a term t and a final configuration (s', i'). A 
dynamic environment M contains local term variable bindings. A configuration 
(s, i) consists of the input stream s and an integer counter i to generate unique 
fresh identifiers x % B distinct from user-defined identifiers of the form xb- 

The parsing rules are given in figure 3. These rules involve syntactic objects 
of the following categories: 

s ::= input streams 



An input stream is a sequence of identifiers, some of which may have been 
declared to be keywords in g (e.g., "if", "("). We use the notation K(g) to 
denote the set of keywords defined in productions of g. The parsing rules for 
terminals use K(g) to distinguish between keywords and identifiers appearing 
in the input stream. 



g-Mh{s,i)p =>(«', 



M 



b 



t 



* empty input stream 

x :: s identifier token 

terms 

unit trivial term 

x Binder binder identifier 

xvar variable identifier 

x L a b e i label identifier 

x l B fresh identifier of sort B (i > 0) 

B £ {Binder, Var, Label} 
c (Bi,...,B t )s(&i) • • • ; frfe) constructed term (k > 0) 

parse results 
b term 
wrong type error 

dynamic environments 
0 empty environment 

M, x = b term binding 
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g; M h (s, i) unit =>• (s, i) unit 

g;Mh(x :: s, i) 'V =>■ (s, i) unit 

g; M h (a; :: s, i) ide(B) =>• (s, i) xb x 0 A'(g) -B 6 {Binder, Var, Label} 

g; M h (s, i) local =>• (s, i + 1} x' Blnder 

g;M\- (s, i) global(:c) =>• (s, i) x Var 

g;M,x = t, M' h (s, i) x =>• (s, i) t x 0 Dom(M') 

g;M\- (s, i) x =>• (s, i) wrong x 0 Dom(M) 

g; M h (s, i)pi =>• (s', i') t t / wrong 

g;M h {s',i')p2 =>• {s",i")t' g; M h (s, z) pi =>■ (s' , i) wrong 



JjMh (s, i) pi p 2 =>■ (s", i") t' 

g; M h (s, i)pi =>• (s' , i') t t / wrong 
g; M, x = t h (s', i') p 2 =>• (s" , i") t' 
g; M h (s, i) r = pi p 2 =>• (s" , i") t' 

g;Mh (s,i) Pl =» (s',i')t 



g;M\- (s, i) pi p2 =>• (s' , i') wrong 

g; M \- (s,i) pi =>• (s' , i') wrong 
g;M \- (s, i) x = pi p 2 =>• (s' , i') wrong 

g;Mh (s,i)p 2 =» (s',i')t 



g;M\- (s, i)pi | p 2 =>■ (s , i) t g; M h (s,i)p! \ p 2 =>■ (s' , i) t 

g;M\- (sj-i,ij-i)pj =>■ {sj,ij} tj 1 < j < k 

g;M h {s 0 ,io)c( Bl! ... ! B k ) B (p 1 , . . . ,p k ) =>■ (s k , u) C( Sl ,..., Sfc ) S (ti, . . . , t k ) 

g;M\- (sj-i,ij-i)pj =>• (sj,ij)tj 1 < j < k 

(x : (x\:Bi, . . . , x k :B k )B)==p £ g 
g; 0 x\ = ti . . . x k = t k I" (sk) p =>• (s', i') t 

g;M\- (s 0 , to) x(pi, ■ ■ ■ ,Pk) =>■ (s' , i') t 
g;M\- (sj-i,ij-i)pj =^> {sj,ij} tj 1 < j < k 

(x : (x\:Bi, . . . ) Xk:B k )B==p) 0 g 
g; 0 x\ = ti ... x k = t k I" (sk) p {s', i') t 
g;M\- (s 0 , io) x(pi, . . .,p k ) (s , i) wrong 



Figure 3: Parse rules for terms 
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The sort of a term can be determined without reference to an environment: 

bi ■ Bi . . . bk : Bk 



unit : Unit xb '■ B x B : B 



C(b 1 ,...,b„)b(I>i, ...,b k ):B 



A dynamic environment M is said to match a static environment L (written 
as M b L) if its term bindings have names and sorts compatible with the names 
and sorts in L. 

M b L b : B 

<2)\=<2) ^ 



M,x = b b L,x : B 

The following theorem relates the dynamic parse rules in figure 3 with the 
static type rules presented in subsection 4.1. 

Theorem 1 (Parsing respects typing) If g, E , L, p, M , s, s' , i, and i' are such 
that 

• 0 \- g :: E 

• 0 h g ok 

• E;L h p : B 

• M b L 

• j;Mh (s,i)p => (s',i')t 
then t : B. 



The proof of this theorem can be found in the appendix. In particular, if a non- 
parameterized (L = M = 0) parser with result sort B for a root production po 
defined in a type-correct grammar g consumes the full input stream s (returning 
the empty input stream *), the parse result t is guaranteed to be of sort B: 

Corollary 1 // 

• 0 \- g :: E 

• 0 h g ok, 

• E; 0 h po '■ B , and 

• g;0 h (s, l)p 0 

then t : B and t ^ wrong. 
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It should be noted that the parse rules in figure 3 are non-deterministic due 
to the rules given for the choice operator p\ \ p 2 . In the actual implementation, 
each choice operator in a grammar g is replaced (at grammar-definition time) 
by a choice construct of the form T\ : pi \ T 2 : p 2 . The set TJ- C K(g) U {ide} 
of possible start tokens for phrases accepted by pi is called the director set for 
Pi . The refined parse rules perform a deterministic choice based on the current 
input token: 

Tokens ) = ( ". X " lf ' V ' G K 
0 en \ x ,9) otherwise 

g; M h (x :: s, i) p\ =$-(s',i')t Token(;c, g) £ T\ 



g;Mh {x ::s,i)T 1 : Pl \ T 2 : p 2 ^{s',i')t 

g; M \- (x :: s, i) p 2 =$-(s',i')t Token(;c, g) G T 2 
g;M\- (x-.-.s,^ : Pl \ T 2 : p 2 =>(«', 

The computation of the director sets is accomplished by standard algorithms 
developed for non-incremental LL(1) parsers in time linear to the size of the 
grammar [WG85]. A grammar is rejected as ambiguous if it contains a produc- 
tion T\ : pi I T 2 : p 2 where T\ fl T 2 ^ {}. 



4.3 Pattern-based Production Generation 

In the previous subsection we did not consider the parsing of productions de- 
fined by means of patterns enclosed in <<>>. In this subsection we describe a 
translation of such productions into simpler ones, so that they are covered by 
the parse rules and the theorem given in the previous subsection. 

A pattern x«s» is a pair of a token stream s and a non-terminal x that 
defines which production is to be used to parse s. The result of parsing s is itself 
a production p that neither contains terminal productions that depend on input 
tokens nor choice operators (pi \ p 2 ). If p is later executed within an environment 
that defines bindings for the pattern variables occurring in s, then p performs 
the necessary steps to instantiate correctly (macro-expand) the pattern. These 
steps include defining term bindings, performing term constructor applications, 
and introducing fresh variables, where necessary. 

We describe the effect of parsing a pattern p<<s>> with the notation 

g-L-Rh{s)p =><«') P' 

This formula states that production p when executed in environment g;L;R 
starting with a token stream s returns a token stream s' (the unread tokens of 
s) and a constructed production p' . 

The environment L contains the names and sorts of the pattern variables 
bound in the scope enclosing the pattern. For example, L contains @,x : 
Binder, a : Term, b : Term for the pattern term<<. . .» in the following gram- 
mar: 
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g; L; R\- (s) unit =>• (s) unit 

g; L; R\- {x :: s) ll x" =>• (s) unit 

g; L; R\- {x :: s) ide( Far) =>• (s) global(i?) a; 0 K(g), x 0 Dom(L) 

g; L,x : B, L'; R \- (x :: s) ide(B) =>• (s) x x 0 A'(g), £ 0 Dom(L') 

g; L; R\- (s) local =>• (s) local 

g; L; R\- (s) global(i?) =>• (s) global(i?) 

g; L; R, x <— x' , R' h (s) x =>• (s) x' x 0 Dom(R') 

g;L;R\- (s)pi =>• (s')pi 
g;L;R\- (s 1 ) p 2 =>■ (s")pj 
g; L; R \- (s) p! p 2 => {s") p[ p' 2 

g;L;R\- (s) Pl =>• (s')p'i 
g; L; R x <— a;' h (s')p2 =>■ (s") p 2 a;' 0 Dom(L) U Ran(R) 
g;L;R\- (s) x = pi p 2 =>■ (s") a?' = pi pi 

g;L;R\- (s)p t =>■ (s')pj g; L;R\- {s} p 2 =>■ (^')pi 

g;L;R\- (s)pi | p 2 =>■ (s')pi g; L; R \- (s)pi | p 2 =>• (s')pi 

g;L;R\- (sj-i)pj =>• (sj, )pj l < j < fc 

g;L;R\- (s 0 ) c (Bli ... iB(c)B (pi, . . . ,p*) =>• c (Bli ... iB(c)B (pi, . . . ,pi) 

g;L;R\- (s 3 -i)p 3 =>• {sj,}p 3 1 <j <k 
R' = R 0 Dom(E) U Ran(R) x' t / a?i for i / j 

g; L; R 1 \- (s k ) p =>■ (s 1 ) p' (x : (x\ :B\ , . . . , x k :B k )B==p>) £ g 

g;L;R\- (s 0 )x(p 1 ,. ..,p k ) =>■ (s') iri = pi ... a* = pi p' 



Figure 4: Parse rules for patterns in incremental grammar definitions 
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grammar 

simpleTerm:Term |== 
"let" x=ide "=" a=term "in" b=term => term«(fun(x) b)(a)» 
end 

The notation g;L;R\- (s) p =>■ (■?') p' uses a set R of renamings to avoid name 
conflicts between the pattern variables in L and the term variables introduced 
during pattern parsing: 

R ::= 0 empty renaming 

| R, x <— x' rename x by x' 

When R = 0, x\ <— x[, . . . , x n ^ x' n , we write Ran(R) for the set {x[, . . . , x' n }. 

The complete set of rules for parsing patterns is given in figure 4. As 
an example, here is the production p' that results from parsing the pattern 
term<<(fun(x) b)(a)>> in the environment L = @,x : Binder, a : Term, & : 
Term using the grammar defined in figure 2: 

ai=(xi=x 

a.2 = (a.3=(a4=b 
a 4 ) 
b 2 =(a 3 ) 
b 2 ) 

mkTermFun(xi a 2 )) 
bi=(b 3 =(a 5 =(a 6 =a 
a 6 ) 
b 4 =(a 5 ) 
b 4 ) 

a4=mkTermApp(ai b 3 ) 

a 4 ) 

bi 

In this example, we use subscripted identifiers for fresh term variable identifiers 
introduced during the translation process. Furthermore, we use brackets and 
indentation to indicate the scope of these variable identifiers. By removing 
redundant intermediate bindings, the generated production can be simplified to 
mkTermApp(mkTermFun(x b) a), as expected. 

The following theorem states that the successful parsing of a pattern p<<s>> 
using a production p of a type-correct grammar g yields a well-typed production. 

Theorem 2 If g, E, L\, L2, L3, p, B, and R are such that 

• 0 \- g :: E 

• 0 h g ok 

• E; L3 h p : B where L3 = 0, x\ : B\ , . . . , x n : B n 
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• R = <Z),xi <— x[, . . . , x n <— x' n with x\ ^ x'j for i ^ j and Ran(R) fl 
Dom(Li) = 0 

• g;Li,R\- {s)p => (s')p' 
then E; L\, L2 \~ p' '■ B . 

The proof can be found in the appendix. By specializing this theorem to a 
non-parameterized production po with result sort B, we obtain that a pattern 
p 0 «s» in an arbitrary local environment L can be translated into an equivalent 
production p' that has the result sort B also: 

Corollary 2 // 

• 0 \- g :: E 

• 0 h g ok 

• E;0 hp 0 : B 

• g;L;0\- (s)p 0 => {*)p' 
then E;L,(Z)\-p': B. 

5 An Extensible Parser Package 

Extensible grammars as described in this paper were developed in the context of 
the Tycoon database programming environment [Mat93]. However, as sketched 
in figure 1, the extensible grammar package was implemented in a way that 
factors out all target-language dependencies (the base sorts B l , the abstract 
syntax tree constructors C(Bi,...,s fc )S; an d the renaming operation on abstract 
syntax trees) from the package implementation. 

A token stream s is represented as an object with a local state and methods 
to inspect the current input token and to advance to the next input token. 

A parser for terms of a sort B is represented as a function that takes a 
scanner object and returns a typed abstract syntax tree; the function modifies 
the state of the scanner object and a variable counter used for generating fresh 
variable identifiers. 

A grammar gi is represented as an object of an abstract data type encapsu- 
lating information about the target language TL and the object language OLi 
accepted by gi . The implementor of a compiler for a language with an extensible 
grammar links the parser package into the compiler. A grammar for the target 
language at hand is generated via calls to the parser interface. Finally, a parser 
for this grammar is generated, and it is used to parse actual program input. 

The following steps have to be taken to generate the grammar go and a 
parser for the initial object language OLq. Each of these steps is implemented 
by a function call to the parser package that passes the grammar as an explicit 
argument. 
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1. Creation of an initial (empty) grammar go- Arguments to this operation 
provide information about the tokens returned by the scanner, and func- 
tions for creating fresh identifiers. An initial grammar already contains 
the names of the built-in sorts Label, Var, and Binder. 

2. Addition of named sorts to go- These sorts correspond to abstract-syntax- 
tree types in the target-language compiler. For each newly defined sort, an 
AST copy routine, an AST renaming routine, and a distinguished error 
value have to be supplied. The error value is generated by the parser 
package in case of parse errors. 

3. Addition of named constructors to go. Constructors correspond to func- 
tions in the target-language compiler that take k > 0 typed abstract syn- 
tax trees and return an aggregated syntax tree. For each constructor, the 
list of its argument sorts and its result sort have to be specified. 

4. Addition of a concrete syntax for grammar definitions to go. Target- 
language implementors can either adopt the concrete syntax used in this 
paper (grammar ...end), or define their own tailored syntax for the 
definition of productions p that match the abstract syntax given in sub- 
section 4.1. 

5. Generation of a parser for go. Parser generation involves calculating di- 
rector sets to support efficient LL(1) parsing. Furthermore, variable and 
non-terminal references are resolved into direct table indices. 

6. Parsing of a grammar extension g using the parser generated in the previ- 
ous step. The grammar extension g defines the mapping from OLo terms 
to TL terms. 

7. Extension of go by g. 

8. Generation of a parser for the extended go. 

A parser for OLi derived from a grammar gi returns either a term of the 
target language proper, or an abstract syntax tree for an incremental syntax 
extension g&. In the latter case, the parser package is invoked to check the 
type correctness of g<\ in the scope of the environment E{ established by the 
current grammar gi. If this check succeeds, the extended grammar is obtained 
by normalizing the grammar sequence gi,g^ 9i+i- Finally, a new parser is 
generated for gi+i; this parser can then be used to parse further input in the 
extended language OLi + \. 

If the parsing result is a term t of the target language, the parser package 
also returns a list of variable renamings. These renamings have to be performed 
by the target-language compiler in t to establish bindings to global variable 
identifiers (see subsection 3.3). 
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6 Comparison with Related Work 



Extensibility has been studied previously in the context of programming lan- 
guages and theorem provers [Dow90]. In the early work on language extensi- 
bility [Gal74, Sta75], both syntax and semantics could be modified arbitrarily, 
sometimes with disastrous effects [Chr90] . Traditional macro facilities allow only 
syntax extensions. We have already discussed some of the defects of macros. 
Several recent works propose improvements on macros. 

Linguistic reflection [SMM91, SSS + 92, SSF92, Kir92] in persistent program- 
ming languages has been used to add high-level (query) notations to strongly- 
typed programming languages. These extensions are achieved by executing 
user-defined code at compile time; this code transforms syntax trees returned 
from the parser prior to further processing by the type checker and code gen- 
erator. Our approach differs from this work since we are able to guarantee the 
termination of compilation, even when our transformation operations are defined 
recursively. Furthermore, we are not aware of work in the context of linguis- 
tic reflection to handle correctly the problematic binding situations sketched in 
subsection 3.3. 

Some non-persistent language implementations, like CAML and SML, inte- 
grate YACC or a similar parser generator that allows them to introduce new 
syntax [MR92]. If the new syntax is to be mixed with the old one, the new 
syntax must be quoted in some way. Instead, we can freely intermix new and 
old syntax without special quotations; it is also possible to remove existing 
keywords by redefining non-terminals with the :== operator. 

Hygienic macros [KFFD92, K0I186] have goals similar to those of our ex- 
tensible grammars; these macros also work on the abstract syntax and avoid 
binding anomalies. However, these macros account only for explicit (parame- 
terized) macro calls and not for more liberal keyword-based syntax extensions. 
Hygienic macros employ a multi-pass time-stamping algorithm to prevent vari- 
able capture; this algorithm is different from our one-pass renaming algorithm. 
Furthermore, we do not handle quotation and antiquotation in the style of Lisp. 

Griffin [Gri88] has enumerated desirable properties of notational definitions 
and has studied their formalization. Unlike Griffin, who translates notations to 
combinator form, we are able to handle variables bound to non-local binders 
in patterns. Moreover, while Griffin discusses abstract translations, we pro- 
vide a specific grammar definition technique and an efficient parsing algorithm. 
Parsing is efficient because it is LL(1) and because it avoids the creation of 
intermediate parse trees, producing abstract syntax trees that do not require 
normalization. 

Bove and Arbilla [BA92] discuss how to use explicit substitutions to imple- 
ment syntax extensions. Theirs is an elegant idea that may be exploited in 
systems where the target compiler supports explicit substitutions. As in the 
previous case, their work does not describe a parsing algorithm, but presents an 
interesting theory. 
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Traditionally, the most sophisticated macro-definition facilities have been 
developed for Lisp-like languages; the regular syntactic structure of Lisp simpli- 
fies program manipulation. Recent work has extended AST macro manipulation 
to syntactically complex languages. For example, Weise and Crew use a full C 
language extended with patterns as a preprocessor for the C language [WC93]; 
their macros have syntactic types (our sorts) that guarantee the generation of 
well-formed AST's. We have achieved considerable flexibility in the manipula- 
tion of complex languages, but without resorting to a computationally complete 
macro language. This way, we can guarantee termination of the parsing phase. 

7 Conclusion 

Extensible grammars avoid many of the problems associated with traditional 
tools for macro expansion and program rewriting, by enforcing sort constraints 
at grammar-definition time and by respecting lexical scoping. Furthermore, 
since extensible parsers introduce only a small set of new concepts, they can be 
integrated with little overhead in current compilation environments. 

Traditional database programming languages have a bias towards a specific 
data model by providing built-in syntactic support tailored to the structures 
and operations of that data model. In a programming environment equipped 
with extensible grammars, such syntactic forms can be eliminated from the core 
language definition and can be introduced in shared application libraries. 
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Appendix 



This appendix contains the proofs of our theorems. 
Proof of Theorem 1 

The proof is carried out by induction on the parsing derivations with the rules 
in figure 3. We treat the rules one by one: 

• g; M \~ (s, i) unit =>• (s, i) unit 

Suppose E;L\- unit : B. Then B = Unit according to the type rules for 
productions. Moreover, by definition unit : Unit. 

• g; M h (x :: s, i) "a;" =>(s,i)unit 

Suppose E;L\~ "a;" : B. Then B = Unit according to the type rules. 
Again, unit : Unit. 

• g; M h (x :: s, i) ide(B) =>• (s, i) xb where x (fi K(g) and 
B £ {Binder, Var, Label} 

Suppose E;L\- ide(B) : B' . According to the type rules, B' can be only 
B and matches the type of the concrete term xb '■ B. 

• g;M\- {s, i) local => {s, i + 1) x Binder 

Suppose E;L\~ local : B. According to the type rules, B has to be 
Binder. Moreover, the term x l Binder has type Binder. 

• g; M \~ (s, i) global(a;) =>• (s, i) xy ar 

Suppose E;L\~ global(a;) : B. Sort B has to be Var. Moreover, xy ar : 
Var. 

• g;M,x = t,M' h (s, i) x => {s, i) t where x £ Dom(M') 

Suppose E;L h x : B. According to the type rules, L has to be of the 
form L', x : B, L" such that x £ Dom(L"). Since M \= L, b : B. 

• g; M h (s, i) x =>• (s, i) wrong where a; ^ Dom(M) 

Suppose E; L h a; : B to obtain a contradiction. According to the type 
rules, x G Dom(L). However, since M |= _L, this implies x G Dom(M), 
contradicting the side condition of the parsing rule. 

• g; M \~ (s, p2 =>■ (s", where t' 7^ wrong 

According to the parse rules there has to be a t ^ wrong such that 
g;M h {s,i)p 1 => (s',i')t and g;M h (s',i')p 2 => (s",i")t'. Moreover, 
suppose that E; L h pi p 2 . B' . According to the type rules E; L h P2 : B' . 
Applying the induction hypothesis we obtain t' : B' . 
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g; M \- (s, i) pi p2 =>■ (s" , i") wrong and g; M h (s, i) pi =>• (s', i') wrong 

Suppose E; L \~ p\ P2 '■ B' to obtain a contradiction. By the type rules 
E;L h p\ : B. However, applying the induction hypothesis, this contra- 
dicts the assumptions since there is no B such that wrong : B. 

g; M \~ (s, i) pi P2 =>■ (s" , i") wrong and g; M h (s, i) p\ =$-(s',i')t 
wrong 

Suppose E; L \~ p\ P2 '■ B' to obtain a contradiction. By the type 
rules E;L\~p2 : B' . According to the parse rules g;M h (s',i')p2 =>■ 
(s", i") wrong. Applying the induction hypothesis leads to the false state- 
ment wrong : B' . 

g; M h (s, i) x = p\ P2 =>■ (s", «") t' where t' ^ wrong 

Suppose E; L \~ x = pi P2 : B' . According to the type rules, it must be 
that, for some B , E; L h p\ : B and E; L, x : B h P2 : B' . According to 
the parse rules g;M h (s,i)p\ =>• (s',i')t and g;M,x = t \- (s',i')p2 =>■ 
(s", i")t' . By induction hypothesis t : B. Hence M, x = t \= L,x : B and 
by applying the induction hypothesis again one establishes that t' : 5'. 

j;Mh (s, i) x = p\ P2 =>■ (s", «") wrong and 
g;M\~(s,i)pi =>• (s', i') wrong 

Suppose i?; L h i = pi P2 : B' to obtain a contradiction. According to 
the type rules, it must be that, for some B , E; L \- p\ : B. Applying the 
induction hypothesis, this leads to a contradiction since there is no B such 
that wrong : B. 

g; M h (s, i) x = p\ P2 =>■ (s", «") wrong and 
g;M\~(s,i)pi =$-(s',i')t t ^ wrong 

Suppose E; L \~ x = p\ P2 '■ B' to obtain a contradiction. According to the 
type rules, it must be that, for some B, E;L,x : B h P2 : B' . Applying 
the induction hypothesis, this leads to a contradiction since there is no B' 
such that wrong : B' . 

g;M\- (s, i)pi \ p 2 => {s' , i')t and g;M \- (s, i)pi => {s',i')t 

Suppose E; L h p\ \ P2 : B. According to the type rules this implies 
E; L h p\ : B. Applying the induction hypothesis to the derivation for p\ 
establishes t : B. 

g;M\- (s, i)pi \ p 2 => {s' , i')t and j;Mh (s, i) p 2 => {s',i')t 

Suppose E; L h p\ \ P2 : B. According to the type rules this implies 
E; L h P2 ■ B. Applying the induction hypothesis to the derivation for P2 
establishes t : B. 
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• g;M \- {s 0 ,io)c( Bl ,...,B k )B(Pi, ■ ■ - ,Pk) => («fc, ik) t where t ^ wrong 

Suppose E;L\- C( Blj ...,s fc )s(pi, • • • ,Pk) ■ B' . The type rules imply B' = 
B ^ wrong and E;L\- pj : Bj for 1 < j < k. Moreover, the parse 
rules guarantee that t is of the form C(Bi,...,s fc )s(ii; • • -,tk) with g;M h 
(sj-i, ij-i) Pj =>■ { s jiij)tj f° r 1 < i < Applying the induction hy- 
pothesis to the derivations for pi through pj. establishes ti : B\ . . .tj, : 
Bk- Using the definition for the types of base terms, we obtain t = 
c (Bi,...,s fc )s(^i; ■ ■ ■ ,tk) ■ B and this case is settled. 

• g; M h (s 0 , «o) • • • ,Pk) => («', where t 7^ wrong 

Suppose E;L\~ x(pi, . . .,Pk) '■ B. The type rules assert that there exist 
(1) Bi such that E;L\- pi : Bi (for 1 < 1 ' < fc) and (2) a non-terminal 
x : S <E E such that S 1 = (x-\_:B-\_, . . . ,Xk'.Bk)B. Applying the induction 
hypothesis to (1) and g; M h (sj-i, ij-\)pj =>■ (sj, ij)tj, we have : By. 

Suppose Q) \~ g :: E . This implies that a; as defined in (1) is unique in 
Furthermore, suppose that <Z> \~ g ok. This implies that (3) E;(l),x\ : 
B\ , . . . , Xk '■ Bk h p : B. Note that M' = 0, x\ = ti , . . . , xj, = tj, \= 
Q),xi : B\, . . . ,Xk '■ Bk- Applying the induction hypothesis to (3) and 
g; M' h (si,) p =>■ (s', i') t, we finally have t : B. 

• g;M\-{s 0 ,i 0 )x(pi,...,p k ) => {s' , i') wrong and 
(x : (xi:Bi,..., x k :B k )B==p) g 

Suppose E;L\- x(pi,...,pk) '■ B to obtain a contradiction. The type 
rules assert that there exists a non-terminal x : S 6 E such that S = 
(x-\_:Bi, . . . , Xk'.Bk)B. Furthermore, suppose that 0 h g :: E. This implies 
together with x : S 6 E that there is a non-terminal definition x : S==p 6 
g contradicting our initial assumption about the derivation. □ 

Proof of Theorem 2 

The proof is performed by induction on the parsing derivations for patterns with 
the rules in figure 4. We treat each rule in turn: 

• g; L\, R h (s) unit =>• (s) unit 

Suppose E; L3 h unit : B. Then B = Unit and E; L\, L2 \~ unit : B. 

• g; L\, R h (x :: s) "a;" =>■ (s) unit 

Suppose h "a;" : 5. Then B = Unit and E; Li, L2 H unit : 5. 

• g; Li; i? h (a; :: s) ide(Var) =>• (s) global(a;) and a; ^ K{g), x (fi Dom(Li) 

Suppose E; L3 h ide(Var) : B. Then B = Var and E; L\, L2 \~ global(a;) : 
Var. 
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g; L, x : B, L 1 ; R h (x :: s) ide(B) =>• (s) a; and x ^ K(g), x ^ Dom(L') 

Suppose R = 0 ^, Ran(R) 0 Dom(L, x : B, V) = 

{}, L 3 = Q),xi : Bi,...,x n : B n , and L 2 = <Z>,x[ : Bi,...,x' n : 5 n . 
Furthermore, suppose E; L% h ide(B) : B. Since x ^ Ran(R) = Dom(L 2 ) 
it follows that E; L,x : B, L' , L 2 \~ x : B . 

g; L\, R h (s) local (s) local 

Suppose E;Ls h local : 5. Then B = Binder and E;L,L 2 h local : 
Binder. 

g; L\, R\~ (s) global(;c) (s) global(;c) 

Suppose L3 h global : 5. Then B = Var and E; L\, L 2 \~ global : Var. 

g; L\, R, x <— x' , R' \- (s) x (s) a;' 2; ^ Dom(R') 

Suppose L3 = £3,2; : B,L' 3 ' and E; L3 \- x : B and = L' 2 ,x' : B,L 2 . 
Since x\ ^ x'j in it follows that x' ^ Dom(R"). Hence E; L\, L' 2 ,x' : 
B,L' 2 ' h x' : B. 

g;L 1 -Rh{s) Pl p 2 => (s")^ p' 2 

Suppose E; L3 h pi p 2 : B' , that is, E; L3 h p\ : B and E; L3 h p 2 : B' . 
We know that g; L\\ R h (s) pi =^ (s 1 ) p[ and g; Li; R \- (s 1 ) p 2 =>(s")p' 2 . 
Applying the induction hypothesis gives E; L\,L 2 h p[ : B and E; L\,L 2 h 
p' 2 : 5'. Hence via the type rules E; L\,L 2 h p[ p' 2 : B' . 

g;Li;R\-{s)x= p x p 2 => {s") x' = p[ p' 2 

Suppose E; Ls \~ x = pi p 2 : B' , that is, (1) E; L3 h p\ : B and 
(2) E;Ls,x : B h p 2 : B' . Using the induction hypothesis, (1) and 
g;L\,R h (s) p\ (s')p'i establish (3) E;L\,L 2 h p[ : B. Since the 
environments R' = R,x : B and L' 3 = Ls,x : B and L' 2 = L 2 ,x' : B 
satisfy Ran(R) C\ Li = {} and x' ^ Dom(Li) U Ran(R), we can apply 
the induction hypothesis to g;L\,R,x <— 2;' h (s')p 2 (s")p' 2 and (2) 
giving E; L\, L 2 ,x' : B \- p' 2 : B' . Using (3) the type rules establish 
E; Li,L 2 h pi p' 2 :B'. 

g;Li;R\~ (s) p x \ p 2 => {s')p[ and g;Li;R\- (s) pi => {s')p[ 

Suppose E;Ls h p\ \ p 2 : B. Because of the type rules E;Ls h p\ : B. 
Using the induction hypothesis for g;L\,R h (s) p\ (s')p'i we obtain 
E; L 1; L 2 \~ p[ :B. 

g;Li;Rh (s) pi \ p 2 => (s')p' 2 and g;Li,Rh (s) p 2 => (s')p' 2 
Analogous to the previous case. 
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• g;L±;R\- (s 0 ) C( Bu ...,b„)b(pi, ■ ■ ■ ,Pk) => (sfc) C(B 1 ,...,B k )B{p'i, ■ ■ ■ ,P k ) 
Suppose E;Ls h c^ Bl ^ Bk ^ B {p\, . . . ,p k ) : B. Because of the type rules 
E;Ls h pj : Bj for 1 < j < k. Together with g;L\,R h (sj-i)pj =>• 
(sj, ) p'j we can apply the induction hypothesis and obtain E; L\,L 2 \~ Pj '■ 
Bj and thereby E; Li,L 2 \- c^ Blt ,„ tBk ) B (p' 1 , . . . ,p' k ) : B. 

• g;L±;R\- (s 0 ) x(p 1} . . . ,p k ) => (s') x[ = p[ ... x' k = p' k p' 
We know that 

1. g;Li;R\- (sj-i)pj => {sj,)p'j for l<j<k 

2. R 1 = R ^ X ~y A X -y ^ . . . j X ^ X y, X \ £ Dom(Li) U Ran(R) x\ ^ 
x'j for i ^ j 

3. g-L l ;R'h{s k )p => (s')p' 

4. (x : (xi:Bi, . . . , x k :B k )B==p) £ g 

Suppose E;Ls h x(pi, . . . ,p k ) : B. From the type rules it follows that 
E; Ls h pj : Bj for 1 < j < k. We can apply the induction hypothesis to 
(1) and obtain E; L\, L 2 \~ p'j '■ Bj. This judgment still holds if we insert 
additional fresh identifiers x\ ^ Dom(Li) U Ran(R) into the environment, 
that is, E; L\, L 2 , x[ : B\, . . . , x'j_ 1 : By-i h p'j : Bj. Assume for now that 
E; L\, L 2 , x[ : B\, . . . ,x' k : B k h p' : B. This allows us to apply the type 
rule for pattern variable bindings k times and we obtain E; L\,L 2 h x[ = 
Pi ■■■ 4 = Pk P' : b. 

Now we prove that indeed E; L\, L 2 , x[ : B\, . . . , x' k : B k h p' : B. Since 
h g E, we have 0 h g :: E and 0 h g ok. Together with (4) this 
establishes E;(l),x\ : B\, . . . ,x k : B k h p : B, and hence E;Ls,xi : 
B\, . . . , x k : B k h p : B. Applying the induction hypothesis to (3) we get 
E;L 1 ,L 2 ,x[ : li- <•'.. : IL • // : li. □ 
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